The subway station data is used to find out if there is relationship between hostel locations and subway station locations.
NYsub_raw <- NYsub_raw %>%
mutate(
lat=str_extract(NYsub_raw$the_geom,"\\s[0-9,.]+") %>% as.numeric(),
lon=str_extract(NYsub_raw$the_geom,"-[0-9,.]+") %>% as.numeric()
)
head(NYsub_raw)## URL OBJECTID NAME
## 1 http://web.mta.info/nyct/service/ 1 Astor Pl
## 2 http://web.mta.info/nyct/service/ 2 Canal St
## 3 http://web.mta.info/nyct/service/ 3 50th St
## 4 http://web.mta.info/nyct/service/ 4 Bergen St
## 5 http://web.mta.info/nyct/service/ 5 Pennsylvania Ave
## 6 http://web.mta.info/nyct/service/ 6 238th St
## the_geom LINE
## 1 POINT (-73.99106999861966 40.73005400028978) 4-6-6 Express
## 2 POINT (-74.00019299927328 40.71880300107709) 4-6-6 Express
## 3 POINT (-73.98384899986625 40.76172799961419) 1-2
## 4 POINT (-73.97499915116808 40.68086213682956) 2-3-4
## 5 POINT (-73.89488591154061 40.66471445143568) 3-4
## 6 POINT (-73.90087000018522 40.88466700064975) 1
## NOTES
## 1 4 nights, 6-all times, 6 Express-weekdays AM southbound, PM northbound
## 2 4 nights, 6-all times, 6 Express-weekdays AM southbound, PM northbound
## 3 1-all times, 2-nights
## 4 4-nights, 3-all other times, 2-all times
## 5 4-nights, 3-all other times
## 6 1-all times, exit only northbound
## lat lon
## 1 40.73005 -73.99107
## 2 40.71880 -74.00019
## 3 40.76173 -73.98385
## 4 40.68086 -73.97500
## 5 40.66471 -73.89489
## 6 40.88467 -73.90087
## URL OBJECTID NAME the_geom
## Length:473 Min. : 1.0 Length:473 Length:473
## Class :character 1st Qu.:119.0 Class :character Class :character
## Mode :character Median :237.0 Mode :character Mode :character
## Mean :238.1
## 3rd Qu.:355.0
## Max. :643.0
## LINE NOTES lat lon
## Length:473 Length:473 Min. :40.58 Min. :-74.03
## Class :character Class :character 1st Qu.:40.68 1st Qu.:-73.98
## Mode :character Mode :character Median :40.72 Median :-73.95
## Mean :40.73 Mean :-73.94
## 3rd Qu.:40.78 3rd Qu.:-73.90
## Max. :40.90 Max. :-73.76
# price (extreme value)
ggplot(airbnb,mapping=aes(price)) +
geom_density(kernel = "gaussian")+
theme_classic() +
theme(legend.position="top")+
ggtitle("Price distribution (All data)")# review
ggplot(airbnb,mapping=aes(number_of_reviews)) +
geom_histogram(binwidth = 5)+
theme_classic() +
theme(legend.position="top")+
ggtitle("Review distribution (All data)")As seen in the plots - Price distribution (All data) & Price distribution (All data) , the ranges of price and review are large there are many extreme values.
# map
pal <- colorFactor(palette = "Dark2",domain=airbnb$neighbourhood_group)
ab_map <- leaflet() %>%
setView(lng = -73.9, lat = 40.73, zoom = 10) %>%
addProviderTiles(providers$Esri.OceanBasemap) %>%
addCircleMarkers(data=airbnb,
lng=~longitude,
lat=~latitude,
popup = ~name,
radius=2,
color=~pal(neighbourhood_group),
stroke=FALSE,
fillOpacity = 0.5
) %>%
addCircleMarkers(data=NYsub_raw,
lng=~lon,
lat=~lat,
popup = ~NAME,
radius=1,
color="black",
stroke=1,
fillOpacity = 1
)%>%
addLegend(data=airbnb,"bottomright", pal =pal,
values = ~neighbourhood_group,
title = "NYC Airbnb Location<br>by Neighbourhood",
opacity = 1
) %>%
addLegend(data=NYsub_raw,"topright",
colors =c("#000000"),
labels= c("Subway station"),
title= "NYC Subway Locations",
opacity = 1)
ab_maprenamed_cor<-airbnb %>% rename("ID"=id, "Name"=name, "Host ID"=host_id, "Host Name"=host_name, "Neighbourhood Group"=neighbourhood_group,
"Neighbourhood"=neighbourhood,"Latitude"=latitude, "Longitude"=longitude, "Room Type"=room_type,
"Price"=price, "Minimum Nights"=minimum_nights, "No. of Reviews"=number_of_reviews, "Last Review"=last_review,
"Reviews Per Month"=reviews_per_month,"No. of Listings"=calculated_host_listings_count, "Availability"=availability_365)
airbnb_cor <- renamed_cor[, sapply(renamed_cor, is.numeric)]
airbnb_cor <- airbnb_cor[complete.cases(airbnb_cor), ]
correlation_matrix <- cor(airbnb_cor, method = "spearman")
corrplot(correlation_matrix, method = "square",order = "alphabet",tl.cex =0.7,tl.col = "black",tl.srt = 45,cl.cex=0.7)Obviously, Price and location of the stay have the strongest negative relation marked in dark orange as shown. Availability and the reviews per month, and no. of listings with availability reveal positive relation. The possible reasons may be a sizable host offers better stay and is more approachable to travelers.
ggplot(airbnb,aes(x = neighbourhood_group)) +
geom_bar(aes(fill= neighbourhood_group))+
scale_fill_manual(values=c("#FFFF00", "#66CC33", "#006666", "#003366", "#660033"))+
geom_text(stat = 'count',aes(label =..count.., vjust=-0.3))+
labs(title="Number of Listings vs Neighbour Group", x="Neighbourhood Group", y = "Number of listings")+
theme_minimal()In the Neighbourhood Group, Manhattan and Brooklyn have the greatest number of listing with 21,661 and 20,104 respectively, while Staten Island has the lowest number of 373 only.
review_pie<-airbnb%>%group_by(neighbourhood_group)%>% summarise(Total_review1 =sum(number_of_reviews, na.rm = TRUE))## `summarise()` ungrouping output (override with `.groups` argument)
sumreview <- sum(review_pie$Total_review1)
review_pie$Total_review <- review_pie$Total_review1 * 100 / sumreview
ggplot(review_pie, aes(x = "", y = Total_review, fill = neighbourhood_group)) +
geom_bar(width = 1, stat = "identity")+
coord_polar("y", start = 0)+
scale_fill_brewer(palette = "Blues")+
geom_text(aes(label = paste0(round(Total_review), "%")), position = position_stack(vjust = 0.5), size=3)+
labs(title="Total Reviews vs Neighbour Group", x=element_blank(), y=element_blank())Taking No. of review and neighbourhood_group from the dataset, it revealed that the neighbourhood Brooklyn (43%), Manhattan(40%) and Queens (14%) received the largest number of reviews accordingly. It is believed that Brooklkyn is the most popular region for Airbnb stay.
circular_bar<-airbnb%>%group_by(neighbourhood,neighbourhood_group)%>% summarise(Total_review =sum(number_of_reviews, na.rm = TRUE))## `summarise()` regrouping output by 'neighbourhood' (override with `.groups` argument)
circular_bar<-cbind(circular_bar, id=c(1:221))
circular_bar<-arrange(circular_bar,desc(Total_review))
circular_bar<-circular_bar[1:30, ]
circular_bar<-cbind(circular_bar, id2=c(1:30))
circular_bar<-arrange(circular_bar,desc(id))
label_data <- circular_bar
number_of_bar <- nrow(label_data)
angle <-90 - 360 * (label_data$id2-0.5) /number_of_bar
label_data$hjust<-ifelse(angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)
ggplot(circular_bar, aes(x=as.factor(id), y=log2(Total_review), fill=neighbourhood_group)) +
geom_bar(stat="identity", alpha=0.5) +
ylim(-5,30) +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.margin = unit(rep(-1,4), "cm")
) +
coord_polar(start = 0)+
geom_text(data=label_data, aes(x=id2, y=log2((Total_review))+3.5,
label=neighbourhood, hjust=hjust),
color="black",fontface="bold",alpha=0.8, size=2.6,
angle= label_data$angle, inherit.aes = FALSE )Brooklyn and Manhattan got the most number of reviews as compared with Queens and the two others. For instance, sub-districts namely Williamsburg in Brooklyn and Washington Heights in Manhattan are especially representative in large number of reviews.
#Facets of Price vs Nos. of Listings of 5 neighbourhood_group
ggplot(data = airbnb) +
geom_point(mapping = aes(x = price, y = calculated_host_listings_count,colour = "#F38434")) +
facet_wrap(~ neighbourhood_group, nrow = 2)+
labs(title="Number of listings vs Price per Neighbour Group", x="Price", y = "Number of listings")Facet Warp charts are used to show the relation between Price and various Neighbour Groups. Significant price difference is reflected in Manhattan representing pricey stay in the region. Alternatively, host offers in Bronx and Staten Island are rather much cheaper, probably to be the grimy regions in New York.
ggplot(airbnb, aes(x=neighbourhood_group, y=log10(price), fill=neighbourhood_group)) +
geom_jitter(aes(colour=neighbourhood_group, alpha=0.5)) +
geom_boxplot(alpha=0.3, outlier.colour = "black", outlier.shape = 1, notch = TRUE) +
theme(legend.position="none")+
labs(title="Price vs Neighbour Group", x="Neighbourhood Group", y = "Price (log10)")## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
On average, the Mean, Median, 1st and 3rd quantiles of hostel price in Manhattan lead the others dominantly as tallied with the previous chart too. The phenomenon may be regarded to the expensive consumption level in Manhattan, the core district in New York.
airbnb %>%
ggplot(aes(x=neighbourhood_group ,fill =room_type))+
labs(title = "Proportion of room type in different neighbourhood group",
x = 'Neighbourhood group',
y = 'Proportion')+
geom_bar(position = 'fill')+
theme_classic()The reason of higher hostel price in Manhattan is due to the more portion of entire home compared with other nighbourhood group.
airbnb %>% group_by(room_type) %>%ggplot(aes(x=room_type))+
labs(title = "Distribution of room type",
x = "Room Type",
y= "Count")+
geom_bar(aes(fill=room_type),fill=colors)Three room types are provided by the hosts with half shared by Entire home/apt. The penetration of Shared room is the least.
ylim1 = boxplot.stats(airbnb$price)$stats[c(1, 5)]
airbnb%>%
group_by(room_type)%>%
ggplot(aes(x=room_type, y=price),fill=colors)+
labs(title = "Compariosns among room type with price",
x= "Room Type",
y= "Price")+
geom_violin(aes(fill=room_type))+
scale_fill_manual(values=colors)+
coord_cartesian(ylim = ylim1)+
theme_fivethirtyeight()+
theme(axis.title = element_text())The distribution of price for Entire home is more diverse and is also more expensive. Rather, Shared room is the cheapest among them.
airbnb%>%
ggplot(aes(x=availability_365))+
labs(title = "Availability of all hostel",
y="Frequency",
x="Number of days"
)+
stat_bin(geom = "path" , pad = FALSE)+
theme_fivethirtyeight()+
theme(axis.title = element_text())## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph depicts that most hostels are available in days below 100 days. It is also discovered that there exists outlier of zero availability (availability_365 = 0) ~39% in the graph that is worth having a deeper investigation.
colors=c("#F3BD5E", "#164597")
fig <- combinedAvailability %>% plot_ly(labels=~name, values = ~count,marker = list(colors = colors,
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text'
))
fig <- fig %>% add_pie(hole = 0.45, showlegend=TRUE)
fig <- fig %>% layout(title = "Distribution of hostel with zero Availability", showlegend = T,
xaxis = list(showgrid = TRUE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = TRUE, zeroline = FALSE, showticklabels = FALSE))
fig## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The larger share depicted that the majority of availability more than one day (64.1%) but with zero availability (35.9%) also have a significant share.
colors <- c('#0000cc','#00b35c', '#ffcf66')
fig <- plot_ly(roomtype, labels = ~room_type, values = ~count, type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~paste('$', count, ' billions'),
marker = list(colors = colors,
line = list(color = '#FFFFFF', width = 1.5)),
#The 'pull' attribute can also be used to create space between the sectors
showlegend = FALSE)
fig <- fig %>% layout(title = 'Room type distribution of zero availability hotels',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
figAligned with the overall picture, the share of room type is similar, presenting half of them are Entire home/apt (50.6%), followed by Private room (47.7%) and the least penetration of Shared room (1.69%).
p <- ggplot(zero,aes(x=room_type, y=log10(price)), fill='neighbourhood_group') +
geom_jitter(aes(colour=neighbourhood_group, alpha=0.5)) +
geom_hline(yintercept=0) +
labs(title = "Price distribution of ZERO availablity by different room types and neighbourdhood area",
x = "Room Type",
y = "Price(log)")
fig <- ggplotly(p,width = 800, height = 600)
figThe distribution between price and various room types under zero availability demonstrated that staying in Manhattan is pretty much more pricey than other regions. Yet, the price distribution of shared room is not obvious.
fig <- plot_ly(zero, x = ~price, y = ~number_of_reviews, z = ~reviews_per_month,
color = ~room_type, colors= c('#f7a76e', '#6BA1C4','#ff0066'),
width = 800, height = 600)
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'price'),
yaxis = list(title = 'number_of_reviews'),
zaxis = list(title = 'reviews_per_month')),
annotations = list(
x = 1.13,
y = 1.05,
showarrow = FALSE
))
fig## Warning: Ignoring 4841 observations
The chart tells more expensive hostels tend to have fewer reviews or none. Rationally, competitively priced hostels more popularity to have more reviews. Price is demonstrated to be the most crucial incentive in motivating customers to write a review. By ranking the hostels in the descending order of number of reviews, the average price of the top 100 hostel is $89 while the bottom 100 hostel is $116. Comparatively, Private room dominates the markets as it exhibits more reviews than the other two room types. Insightfully, some rooms might be booked by the hostel official reservation system such that those hostels were always full and need not to be listed online for rental. This information is crucial for the hotel management people and owners to know the market, and then conduct strategic plans.
# NYC Airbnb
NYab <- airbnb %>% select(-c(host_id,host_name,
last_review,
reviews_per_month,
calculated_host_listings_count))
# only take Q3-Q1 data in NYab and remove Staten Island data
sum_price<-summary(NYab$price)
NYab <- NYab %>%
filter(price<=sum_price[5] & price >=sum_price[2] & neighbourhood_group!="Staten Island")
# Price distribution
ggplot(NYab,mapping=aes(price)) +
geom_histogram(binwidth = 5,color="black", fill="lightblue")+
scale_color_grey() +
theme_classic() +
theme(legend.position="top")+
ggtitle("Price distribution (Only IQR)")# NYC subway
NYsub <- NYsub_raw %>% select(-c(URL,
the_geom))
# Initialize extra col
NYab <- NYab %>% mutate(station_dis=rep(0,length(NYab$id)),
closest_station=rep(0,length(NYab$id),),
station_ID=rep(0,length(NYab$id))
)
NYsub <- NYsub %>% mutate(No_lines=str_count(NYsub$LINE,"[^-]+"))
summary(NYsub)## OBJECTID NAME LINE NOTES
## Min. : 1.0 Length:473 Length:473 Length:473
## 1st Qu.:119.0 Class :character Class :character Class :character
## Median :237.0 Mode :character Mode :character Mode :character
## Mean :238.1
## 3rd Qu.:355.0
## Max. :643.0
## lat lon No_lines
## Min. :40.58 Min. :-74.03 Min. :1.000
## 1st Qu.:40.68 1st Qu.:-73.98 1st Qu.:1.000
## Median :40.72 Median :-73.95 Median :2.000
## Mean :40.73 Mean :-73.94 Mean :1.877
## 3rd Qu.:40.78 3rd Qu.:-73.90 3rd Qu.:2.000
## Max. :40.90 Max. :-73.76 Max. :5.000
In order to have accurate analysis, extreme values and unused data are removed.
# Calculate closest subway station
for (i in 1:length(NYab$id)){
dis_temp <- rep(0,length(NYsub$OBJECTID))
for (j in 1:length(NYsub$OBJECTID)){
dis_temp[j]=distance(ang2rad(NYab$latitude[i]),ang2rad(NYab$longitude[i]),ang2rad(NYsub$lat[j]),ang2rad(NYsub$lon[j]))
}
NYab$station_dis[i]=min(dis_temp)
NYab$closest_station[i]=NYsub$NAME[which.min(dis_temp)]
NYab$station_ID[i]=NYsub$OBJECTID[which.min(dis_temp)]
}
(sum_dis <- summary(NYab$station_dis))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002372 0.204155 0.314456 0.411766 0.460448 8.063417
Distance from hostels and the relative closest subway station are calculated using above function. As we can see from the summary, there are outliers as well. The greatest distance from hostel to its closest station is 8.063417 Km.
# Remove data (distance > Q3)
NYab_dis <- NYab %>% filter(station_dis<=sum_dis[5])
# Visualization
ggplot(NYab_dis,mapping=aes(station_dis)) +
geom_histogram(binwidth = 0.02)+
scale_color_grey() +
theme_classic() +
theme(legend.position="top")+
ggtitle("Distance from hostels to \nits closest subway station ")Since there are extreme values in the distance as well, those values are removed to ensure the analysis is accurate.
# Scatter plot by room type (price vs distance from subway)
price_dis_rmtp <- NYab_dis %>% ggplot(mapping=aes(y=station_dis,x=price,color=room_type))+
geom_point()+
geom_smooth(se=F)+
ggtitle("Scatter Plot of price (by room type) \nand distance from closest subway station") +
xlab("Price") + ylab("Distance(Km)")
price_dis_rmtp## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As shown in the scatter plot, there is no obvious relationship between distance from it closest station and the price.
# Calculate number of hostel near each subway
nb_bnb <- NYab_dis %>% group_by(closest_station) %>% summarise(n=n())## `summarise()` ungrouping output (override with `.groups` argument)
nb_bnb <- rename(nb_bnb,NAME=closest_station)
NYsub <- NYsub %>% left_join(nb_bnb,by="NAME")
NYsub <- NYsub %>% rename(No_bnb=n)
# Sort subway station by no of nearby bnb
NYsub_bnb <- NYsub %>% arrange(desc(No_bnb))
# Bubble on map station by no of bnb
mybins <- seq(0, 500, by=100)
palette_rev <- rev(brewer.pal(5, "Spectral"))
mypalette <- colorBin( palette=palette_rev,
domain=NYsub$No_bnb, na.color="transparent", bins=mybins)
sub_bb_map <- NYsub %>% leaflet() %>%
setView(lng = -73.9, lat = 40.73, zoom = 11) %>%
addProviderTiles(providers$Esri.OceanBasemap) %>%
addCircleMarkers(~lon, ~lat,
fillColor = ~mypalette(No_bnb), fillOpacity = 0.7, color="white", radius=3, stroke=FALSE
) %>%
addLegend( pal=mypalette, values=~No_bnb, opacity=0.9, title = "No. of nearby hostels<br>by stations", position = "bottomright" )
sub_bb_mapThere are many hostels located close to subway stations in Manhattan as shown in above map. Therefore, we conclude that the most crowded area is around the central park.
## word freq
## room room 10041
## bedroom bedroom 8214
## private private 7313
## apartment apartment 6585
## cozy cozy 5051
## apt apt 4068
## studio studio 4059
## brooklyn brooklyn 4022
## the the 3882
## spacious spacious 3769
## manhattan manhattan 3385
## with with 3099
## park park 3086
## east east 3074
## sunny sunny 2921
## and and 2870
## williamsburg williamsburg 2677
## beautiful beautiful 2503
## near near 2346
## village village 2297
A surprising finding shows that having the words “private”, “cozy”, and “sunny” are actually associated with lower median price. Rather, having the words “spacious” and “beautiful” have no obvious association with the median.
This comparison is inspired by the following study, which analyze the relationship between how food is described in restaurant menu and other variables e.g., price “Word Salad: Relating Food Prices and Descriptions” https://homes.cs.washington.edu/~nasmith/papers/chahuneau+gimpel+routledge+scherlis+smith.emnlp12.pdf
To sum up, price and location give a certain extent of influence to the market. Indeed, some side factors elements may also take into consideration. Despite sourcing the distance to the subway from the hostel as an additional criteria in analyzing the market situation, the result showed no strong relation where the hostels are located in. More importantly, it pays attention to see that the hostels with zero availability are generally with lower prices and fewer reviews. The reason behind the scenario may be those rooms are private rooms in major so that the price is the cheapest and they are not profitable to be listed on Airbnb that commission/fee is to be charged by Airbnb as the broker. That may possibly underestimate the popularity of those hostels.
Furthermore, it is noteworthy to see some particular names of hostels are more appealing to customers at their booking stage. To create the name of the hostel with attractive wordings, maybe an important concern for owners.
The marketing direction is believed to be led by the room type offered and for a more lucrative business.
The analysis is constrained by the limited availability of concrete attributes of occupancy rate resulting in the number of reviews being built as a proxy for it, thus posing the possibility of bias. Indeed, a single data source and the restricted time dimension is not taken into consideration in the dataset for a thorough and wider scope of analysis.